Image spam filtering using textual and visual information

نویسندگان

  • Giorgio Fumera
  • Ignazio Pillai
  • Fabio Roli
  • Battista Biggio
چکیده

In this paper we focus on the so-called image spam, which consists in embedding the spam message into images attached to e-mails to circumvent statistical techniques based on the analysis of body text of e-mails (like the “bayesian filters”), and in applying content obscuring techniques to such images to make them unreadable by standard OCR systems without compromising human readability. We argue that a prominent role against image spam will be played by computer vision techniques, in particular visual pattern recognition and image processing techniques. We then discuss two possible approaches to defeat image spam: exploiting the high-level textual information embedded into images by combining OCR and text categorization techniques, and exploiting the low-level image information to detect content obscuring techniques applied to spam images. We also report some results of an experimental investigation on a large data set of spam e-mails, aimed at evaluating the effectiveness of combining standard OCR and text categorization techniques, and preliminary results on the use of low-level features to detect image defects (like broken characters or noise components interfering with characters in a binarized image) which are typical consequences of content obscuring techniques that spammers are using.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Trends in Combating Image Spam E-mails

With the rapid adoption of Internet as an easy way to communicate, the amount of unsolicited e-mails, known as spam e-mails, has been growing rapidly. The major problem of spam e-mails is the loss of productivity and a drain on IT resources. Today, we receive spam more rapidly than the legitimate e-mails. Initially, spam e-mails contained only textual messages which were easily detected by the ...

متن کامل

A Comparative Analysis of the Effect of Visual and Textual Information on the Health Information Perception of High School Girl Students in Tehran

Purpose: Information and information sources can be divided into three broad categories according to their nature or type: textual information (book, journal article, conference paper, dissertation, newspaper, etc.), visual information (infographic, photo, Cartoons, films, etc.) and audiovisual information. The purpose of this study is to determine the effect of reading textual information in c...

متن کامل

Detecting Image Spam Using Image Texture Features

Filtering image email spam is considered to be a challenging problem because spammers keep modifying the images being used in their campaigns by employing different obfuscation techniques. Therefore, preventing text recognition using Optical Character Recognition (OCR) tools and imposing additional challenges in filtering such type of spam. In this paper, we propose an image spam filtering tech...

متن کامل

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

This paper introduces our work in the TREC2005 SPAM track. Naïve Bayes and Littlestone’s Winnow are chosen as our basic classifiers. In our investigation, we found that when the structures of Ham and Spam are very different, the feature distributions of them vary a lot. Thus the factor of structure is introduced into our filter. Besides textual word feature, some kind of other features are also...

متن کامل

Embedded-Text Detection and Its Application to Anti-Spam Filtering

Embedded-Text Detection and Its Application to Anti-Spam Filtering Ching-Tung Wu Embedded-text in images usually carry important messages about the content. In the past, several algorithms have been proposed to detect text boxes in video frames. Previous work often followed a multi-step framework using a combination of image-analysis and machine-learning techniques. In this work, we propose a u...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007